home *** CD-ROM | disk | FTP | other *** search
-
- >Darn good question. Your approach appears to have the correct
- >results, but I'm not sure it's practical for many implementations
- >(global search-and-replace operations are inconvenient for
- >sequential processing models), and it certainly isn't a healthy
- >way to think about SGML documents.
-
- But most browsers seem to have cacheing anyway, which means they can do
- global search/replace. But you can still do it more or less sequentially.
- Just buffer strings of new-lines until you know what follows them, and
- then deal with it. There's no method you can propose which is correct
- and doesn't involve storing something somewhere.
-
- >The way to think about SGML documents, IMHO, is this: the sequence
- >of characters comprising an SGML document are presented to an
- >SGML parser, which parses the markup from the data and passes
- >the "results" to the processing application.
-
- This is another alternative I considered. But I figured that I have to
- deal with various parsing things when I read the HTML anyway. I was
- just going to take each chunk of data, (with anchors pre-processed out)
- and remove all whitespace at the beginning and end (except for PRE sections
- and such). But if someone put in whitespace, why should I muck with it?
- Who knows, they might have even wanted it there.
-
- >>1. For each tag NOT in
- >> <PRE> </PRE> <A> </A> <PLAINTEXT>
- >> remove ALL surrounding new-lines.
- >
- >First, let's get one thing straight: the PLAINTEXT element as
- >described by the original HTML documentation is not representable
- >in SGML. For my purposes, I consider the HTML document to
- >end at the <PLAINTEXT> tag, and I consider the rest of the
- >data stream to be an RFC-822 message body or a MIME text/plain body,
- >and not SGML at all.
-
- I hadn't meant otherwise. But you have to read it in anyway, and since
- my method deals with things prior to any other parsing, you treat it
- all as one clump.
-
- >Next, let's keep in mind that you can't do things like the following
- >global substitition,
- >s/\n+(<(H1|H2|ADDRESS...))>/$2/g;
- >because it might find things that look like tags but aren't,
- >for example
- >
- ><foo bar="
- ><H1>this is a little cooky, but nontheless legal and possible.">
- >
- >But even if you're using a proper SGML parser, consider:
- >
- ><H1>Here we go!
- ><a href="#xyz">click here</a>
- >There we went!
- ></H1>
- >
- >The parser will return an H1 start tag, and then the
- >string "Here we go!\n". At this point, your rule doesn't
- >tell me what to do with the newline. I have to get
- >the next object before I decide.
-
- Like I said before, You have to do some sort of storage at some point
- anyway.
-
- >Hmm... I guess that's reasonable. But I'd rather just pass all the
-
- Like I said before, You have to do some sort of storage at some point
- anyway.
-
- >My point is: don't use whitespace to represent significant
- >information except in the PRE elemnt. Use the tags that
- >are defined to have significance.
-
- I suppose I agree with this more or less, at least from the point of view
- of generating my own code. But we have to make something clear - can
- a browser keep all the whitespace if it wants to? Or in other words,
- can an html generator assume collapsing whitespace, or just be aware
- that it might happen?
-
- tom
-
-